{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本章原文：https://usyiyi.github.io/nlp-py-2e-zh/7.html\n",
    "\n",
    "    1.我们如何能构建一个系统，从非结构化文本中提取结构化数据如表格？\n",
    "    2.有哪些稳健的方法识别一个文本中描述的实体和关系？\n",
    "    3.哪些语料库适合这项工作，我们如何使用它们来训练和评估我们的模型？\n",
    "\n",
    "**分块** 和 **命名实体识别**。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1 信息提取\n",
    "\n",
    "信息有很多种形状和大小。一个重要的形式是结构化数据：实体和关系的可预测的规范的结构。例如，我们可能对公司和地点之间的关系感兴趣。给定一个公司，我们希望能够确定它做业务的位置；反过来，给定位置，我们会想发现哪些公司在该位置做业务。如果我们的数据是表格形式，如1.1中的例子，那么回答这些问题就很简单了。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['BBDO South', 'Georgia-Pacific']\n"
     ]
    }
   ],
   "source": [
    "locs = [('Omnicom', 'IN', 'New York'),\n",
    "        ('DDB Needham', 'IN', 'New York'),\n",
    "        ('Kaplan Thaler Group', 'IN', 'New York'),\n",
    "        ('BBDO South', 'IN', 'Atlanta'),\n",
    "        ('Georgia-Pacific', 'IN', 'Atlanta')]\n",
    "query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']\n",
    "print(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以定义一个函数,简单地连接 NLTK 中默认的句子分割器[1],分词器[2]和词性标注器[3]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ie_preprocess(document):\n",
    "    sentences = nltk.sent_tokenize(document)                                # [1] 句子分割器\n",
    "    sentences = [nltk.word_tokenize(sent) for sent in sentences]  # [2] 分词器\n",
    "    sentences = [nltk.pos_tag(sent) for sent in sentences]             # [3] 词性标注器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk, re, pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 词块划分\n",
    "\n",
    "我们将用于实体识别的基本技术是词块划分，它分割和标注多词符的序列，如2.1所示。小框显示词级分词和词性标注，大框显示高级别的词块划分。每个这种较大的框叫做一个词块。就像分词忽略空白符，词块划分通常选择词符的一个子集。同样像分词一样，词块划分器生成的片段在源文本中不能重叠。\n",
    "![7.1.png](./picture/7.1.png)\n",
    "\n",
    "在本节中，我们将在较深的层面探讨词块划分，以**词块**的定义和表示开始。我们将看到**正则表达式**和**N-gram**的方法来词块划分，使用CoNLL-2000词块划分语料库**开发**和**评估词块划分器**。我们将在(5)和6回到**命名实体识别**和**关系抽取**的任务。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 名词短语词块划分\n",
    "\n",
    "我们将首先思考名词短语词块划分或NP词块划分任务，在那里我们寻找单独名词短语对应的词块。例如，这里是一些《华尔街日报》文本，其中的NP词块用方括号标记："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (NP the/DT little/JJ yellow/JJ dog/NN)\n",
      "  barked/VBD\n",
      "  at/IN\n",
      "  (NP the/DT cat/NN))\n"
     ]
    }
   ],
   "source": [
    "sentence = [(\"the\", \"DT\"), (\"little\", \"JJ\"), (\"yellow\", \"JJ\"), \n",
    "            (\"dog\", \"NN\"), (\"barked\", \"VBD\"), (\"at\", \"IN\"),  (\"the\", \"DT\"), (\"cat\", \"NN\")]\n",
    "\n",
    "grammar = \"NP: {<DT>?<JJ>*<NN>}\" \n",
    "cp = nltk.RegexpParser(grammar) \n",
    "result = cp.parse(sentence) \n",
    "print(result) \n",
    "result.draw()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![7.2.png](./picture/7.2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 标记模式\n",
    "\n",
    "组成一个词块语法的规则使用标记模式来描述已标注的词的序列。一个标记模式是一个词性标记序列，用尖括号分隔，如\n",
    "```\n",
    "<DT>?<JJ>*<NN>\n",
    "```\n",
    "标记模式类似于正则表达式模式（3.4）。现在，思考下面的来自《华尔街日报》的名词短语：\n",
    "```py\n",
    "another/DT sharp/JJ dive/NN\n",
    "trade/NN figures/NNS\n",
    "any/DT new/JJ policy/NN measures/NNS\n",
    "earlier/JJR stages/NNS\n",
    "Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3 用正则表达式进行词块划分\n",
    "\n",
    "要找到一个给定的句子的词块结构，RegexpParser词块划分器以一个没有词符被划分的平面结构开始。词块划分规则轮流应用，依次更新词块结构。一旦所有的规则都被调用，返回生成的词块结构。\n",
    "\n",
    "2.3显示了一个由2个规则组成的简单的词块语法。第一条规则匹配一个可选的限定词或所有格代名词，零个或多个形容词，然后跟一个名词。第二条规则匹配一个或多个专有名词。我们还定义了一个进行词块划分的例句[1]，并在此输入上运行这个词块划分器[2]。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (NP Rapunzel/NNP)\n",
      "  let/VBD\n",
      "  down/RP\n",
      "  (NP her/PP$ long/JJ golden/JJ hair/NN))\n"
     ]
    }
   ],
   "source": [
    "grammar = r\"\"\"\n",
    "NP: {<DT|PP\\$>?<JJ>*<NN>} \n",
    "{<NNP>+}\n",
    "\"\"\"\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "sentence = [(\"Rapunzel\", \"NNP\"), (\"let\", \"VBD\"), (\"down\", \"RP\"), (\"her\", \"PP$\"), (\"long\", \"JJ\"), (\"golden\", \"JJ\"), (\"hair\", \"NN\")]\n",
    "print (cp.parse(sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意\n",
    "\n",
    "```\n",
    "$\n",
    "```\n",
    "符号是正则表达式中的一个特殊字符，必须使用反斜杠转义来匹配\n",
    "```\n",
    "PP\\$\n",
    "```\n",
    "标记。\n",
    "\n",
    "如果标记模式匹配位置重叠，最左边的匹配优先。例如，如果我们应用一个匹配两个连续的名词文本的规则到一个包含三个连续的名词的文本，则只有前两个名词将被划分："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S (NP money/NN market/NN) fund/NN)\n"
     ]
    }
   ],
   "source": [
    "nouns = [(\"money\", \"NN\"), (\"market\", \"NN\"), (\"fund\", \"NN\")]\n",
    "grammar = \"NP: {<NN><NN>}  # Chunk two consecutive nouns\"\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "print(cp.parse(nouns))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4 探索文本语料库\n",
    "\n",
    "在2中，我们看到了我们如何在已标注的语料库中提取匹配的特定的词性标记序列的短语。我们可以使用词块划分器更容易的做同样的工作，如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(CHUNK combined/VBN to/TO achieve/VB)\n",
      "(CHUNK continue/VB to/TO place/VB)\n",
      "(CHUNK serve/VB to/TO protect/VB)\n"
     ]
    }
   ],
   "source": [
    "cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')\n",
    "brown = nltk.corpus.brown\n",
    "count = 0\n",
    "for sent in brown.tagged_sents():\n",
    "    tree = cp.parse(sent)\n",
    "    for subtree in tree.subtrees():\n",
    "        if subtree.label() == 'CHUNK': print(subtree)\n",
    "        count += 1\n",
    "        if count >= 30: break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.5 词缝加塞\n",
    "\n",
    "有时定义我们想从一个词块中排除什么比较容易。我们可以定义词缝为一个不包含在词块中的一个词符序列。在下面的例子中，barked/VBD at/IN是一个词缝：\n",
    "```\n",
    "[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.6 词块的表示：标记与树\n",
    "\n",
    "作为标注和分析之间的中间状态（8.，词块结构可以使用标记或树来表示。最广泛的文件表示使用IOB标记。在这个方案中，每个词符被三个特殊的词块标记之一标注，I（内部），O（外部）或B（开始）。一个词符被标注为B，如果它标志着一个词块的开始。块内的词符子序列被标注为I。所有其他的词符被标注为O。B和I标记后面跟着词块类型，如B-NP, I-NP。当然，没有必要指定出现在词块外的词符类型，所以这些都只标注为O。这个方案的例子如2.5所示。\n",
    "![7.3.png](./picture/7.3.png)\n",
    "\n",
    "IOB标记已成为文件中表示词块结构的标准方式，我们也将使用这种格式。下面是2.5中的信息如何出现在一个文件中的：\n",
    "```\n",
    "We PRP B-NP\n",
    "saw VBD O\n",
    "the DT B-NP\n",
    "yellow JJ I-NP\n",
    "dog NN I-NP\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3 开发和评估词块划分器\n",
    "\n",
    "现在你对分块的作用有了一些了解，但我们并没有解释如何评估词块划分器。和往常一样，这需要一个合适的已标注语料库。我们一开始寻找将IOB格式转换成NLTK树的机制，然后是使用已化分词块的语料库如何在一个更大的规模上做这个。我们将看到如何为一个词块划分器相对一个语料库的准确性打分，再看看一些数据驱动方式搜索NP词块。我们整个的重点在于扩展一个词块划分器的覆盖范围。\n",
    "## 3.1 读取IOB格式与CoNLL2000语料库\n",
    "\n",
    "使用corpus模块，我们可以加载已经标注并使用IOB符号划分词块的《华尔街日报》文本。这个语料库提供的词块类型有NP，VP和PP。正如我们已经看到的，每个句子使用多行表示，如下所示：\n",
    "```\n",
    "he PRP B-NP\n",
    "accepted VBD B-VP\n",
    "the DT B-NP\n",
    "position NN I-NP\n",
    "...\n",
    "```\n",
    "![7.4.png](./picture/7.4.png)\n",
    "我们可以使用NLTK的corpus模块访问较大量的已经划分词块的文本。CoNLL2000语料库包含27万词的《华尔街日报文本》，分为“训练”和“测试”两部分，标注有词性标记和IOB格式词块标记。我们可以使用nltk.corpus.conll2000访问这些数据。下面是一个读取语料库的“训练”部分的第100个句子的例子："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (PP Over/IN)\n",
      "  (NP a/DT cup/NN)\n",
      "  (PP of/IN)\n",
      "  (NP coffee/NN)\n",
      "  ,/,\n",
      "  (NP Mr./NNP Stone/NNP)\n",
      "  (VP told/VBD)\n",
      "  (NP his/PRP$ story/NN)\n",
      "  ./.)\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import conll2000\n",
    "print(conll2000.chunked_sents('train.txt')[99])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正如你看到的，CoNLL2000语料库包含三种词块类型：NP词块，我们已经看到了；VP词块如has already delivered；PP块如because of。因为现在我们唯一感兴趣的是NP词块，我们可以使用chunk_types参数选择它们："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  Over/IN\n",
      "  (NP a/DT cup/NN)\n",
      "  of/IN\n",
      "  (NP coffee/NN)\n",
      "  ,/,\n",
      "  (NP Mr./NNP Stone/NNP)\n",
      "  told/VBD\n",
      "  (NP his/PRP$ story/NN)\n",
      "  ./.)\n"
     ]
    }
   ],
   "source": [
    "print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 简单的评估和基准\n",
    "\n",
    "现在，我们可以访问一个已划分词块语料，可以评估词块划分器。我们开始为没有什么意义的词块解析器cp建立一个基准，它不划分任何词块："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ChunkParse score:\n",
      "    IOB Accuracy:  43.4%%\n",
      "    Precision:      0.0%%\n",
      "    Recall:         0.0%%\n",
      "    F-Measure:      0.0%%\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import conll2000\n",
    "cp = nltk.RegexpParser(\"\")\n",
    "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n",
    "print(cp.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "IOB标记准确性表明超过三分之一的词被标注为O，即没有在NP词块中。然而，由于我们的标注器没有找到任何词块，其精度、召回率和F-度量均为零。现在让我们尝试一个初级的正则表达式词块划分器，查找以名词短语标记的特征字母开头的标记（如CD, DT和JJ）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ChunkParse score:\n",
      "    IOB Accuracy:  87.7%%\n",
      "    Precision:     70.6%%\n",
      "    Recall:        67.8%%\n",
      "    F-Measure:     69.2%%\n"
     ]
    }
   ],
   "source": [
    "grammar = r\"NP: {<[CDJNP].*>+}\"\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "print(cp.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正如你看到的，这种方法达到相当好的结果。但是，我们可以采用更多数据驱动的方法改善它，在这里我们使用训练语料找到对每个词性标记最有可能的块标记（I, O或B）。换句话说，我们可以使用一元标注器（4）建立一个词块划分器。但不是尝试确定每个词的正确的词性标记，而是根据每个词的词性标记，尝试确定正确的词块标记。\n",
    "\n",
    "在3.1中，我们定义了UnigramChunker类，使用一元标注器给句子加词块标记。这个类的大部分代码只是用来在NLTK 的ChunkParserI接口使用的词块树表示和嵌入式标注器使用的IOB表示之间镜像转换。类定义了两个方法：一个构造函数[1]，当我们建立一个新的UnigramChunker时调用；以及parse方法[3]，用来给新句子划分词块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "class UnigramChunker(nltk.ChunkParserI):\n",
    "    def __init__(self, train_sents): \n",
    "        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n",
    "                      for sent in train_sents]\n",
    "        self.tagger = nltk.UnigramTagger(train_data) \n",
    "\n",
    "    def parse(self, sentence): \n",
    "        pos_tags = [pos for (word,pos) in sentence]\n",
    "        tagged_pos_tags = self.tagger.tag(pos_tags)\n",
    "        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n",
    "        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n",
    "                     in zip(sentence, chunktags)]\n",
    "        return nltk.chunk.conlltags2tree(conlltags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "构造函数[1]需要训练句子的一个列表，这将是词块树的形式。它首先将训练数据转换成适合训练标注器的形式，使用tree2conlltags映射每个词块树到一个word,tag,chunk三元组的列表。然后使用转换好的训练数据训练一个一元标注器，并存储在self.tagger供以后使用。\n",
    "\n",
    "parse方法[3]接收一个已标注的句子作为其输入，以从那句话提取词性标记开始。它然后使用在构造函数中训练过的标注器self.tagger，为词性标记标注IOB词块标记。接下来，它提取词块标记，与原句组合，产生conlltags。最后，它使用conlltags2tree将结果转换成一个词块树。\n",
    "\n",
    "现在我们有了UnigramChunker，可以使用CoNLL2000语料库训练它，并测试其表现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ChunkParse score:\n",
      "    IOB Accuracy:  92.9%%\n",
      "    Precision:     79.9%%\n",
      "    Recall:        86.8%%\n",
      "    F-Measure:     83.2%%\n"
     ]
    }
   ],
   "source": [
    "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n",
    "train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])\n",
    "unigram_chunker = UnigramChunker(train_sents)\n",
    "print(unigram_chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个分块器相当不错，达到整体F-度量83％的得分。让我们来看一看通过使用一元标注器分配一个标记给每个语料库中出现的词性标记，它学到了什么："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('#', 'B-NP'), ('$', 'B-NP'), (\"''\", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]\n"
     ]
    }
   ],
   "source": [
    "postags = sorted(set(pos for sent in train_sents\n",
    "                     for (word,pos) in sent.leaves()))\n",
    "print(unigram_chunker.tagger.tag(postags))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "它已经发现大多数标点符号出现在NP词块外，除了两种货币符号#和*\\$*。它也发现限定词（DT）和所有格（PRP*\\$*和WP$）出现在NP词块的开头，而名词类型(NN, NNP, NNPS，NNS)大多出现在NP词块内。\n",
    "\n",
    "建立了一个一元分块器，很容易建立一个二元分块器：我们只需要改变类的名称为BigramChunker，修改3.1行[2]构造一个BigramTagger而不是UnigramTagger。由此产生的词块划分器的性能略高于一元词块划分器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ChunkParse score:\n",
      "    IOB Accuracy:  93.3%%\n",
      "    Precision:     82.3%%\n",
      "    Recall:        86.8%%\n",
      "    F-Measure:     84.5%%\n"
     ]
    }
   ],
   "source": [
    "class BigramChunker(nltk.ChunkParserI):\n",
    "    def __init__(self, train_sents): \n",
    "        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n",
    "                      for sent in train_sents]\n",
    "        self.tagger = nltk.BigramTagger(train_data)\n",
    "\n",
    "    def parse(self, sentence): \n",
    "        pos_tags = [pos for (word,pos) in sentence]\n",
    "        tagged_pos_tags = self.tagger.tag(pos_tags)\n",
    "        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n",
    "        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n",
    "                     in zip(sentence, chunktags)]\n",
    "        return nltk.chunk.conlltags2tree(conlltags)\n",
    "bigram_chunker = BigramChunker(train_sents)\n",
    "print(bigram_chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 训练基于分类器的词块划分器\n",
    "\n",
    "无论是基于正则表达式的词块划分器还是n-gram词块划分器，决定创建什么词块完全基于词性标记。然而，有时词性标记不足以确定一个句子应如何划分词块。例如，考虑下面的两个语句："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ConsecutiveNPChunkTagger(nltk.TaggerI): \n",
    "\n",
    "    def __init__(self, train_sents):\n",
    "        train_set = []\n",
    "        for tagged_sent in train_sents:\n",
    "            untagged_sent = nltk.tag.untag(tagged_sent)\n",
    "            history = []\n",
    "            for i, (word, tag) in enumerate(tagged_sent):\n",
    "                featureset = npchunk_features(untagged_sent, i, history) \n",
    "                train_set.append( (featureset, tag) )\n",
    "                history.append(tag)\n",
    "        self.classifier = nltk.MaxentClassifier.train( \n",
    "            train_set, algorithm='megam', trace=0)\n",
    "\n",
    "    def tag(self, sentence):\n",
    "        history = []\n",
    "        for i, word in enumerate(sentence):\n",
    "            featureset = npchunk_features(sentence, i, history)\n",
    "            tag = self.classifier.classify(featureset)\n",
    "            history.append(tag)\n",
    "        return zip(sentence, history)\n",
    "\n",
    "class ConsecutiveNPChunker(nltk.ChunkParserI):\n",
    "    def __init__(self, train_sents):\n",
    "        tagged_sents = [[((w,t),c) for (w,t,c) in\n",
    "                         nltk.chunk.tree2conlltags(sent)]\n",
    "                        for sent in train_sents]\n",
    "        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)\n",
    "\n",
    "    def parse(self, sentence):\n",
    "        tagged_sents = self.tagger.tag(sentence)\n",
    "        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]\n",
    "        return nltk.chunk.conlltags2tree(conlltags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "留下来唯一需要填写的是特征提取器。首先，我们定义一个简单的特征提取器，它只是提供了当前词符的词性标记。使用此特征提取器，我们的基于分类器的词块划分器的表现与一元词块划分器非常类似："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    return {\"pos\": pos}\n",
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "ChunkParse score:\n",
    "    IOB Accuracy:  92.9%\n",
    "    Precision:     79.9%\n",
    "    Recall:        86.7%\n",
    "    F-Measure:     83.2%\n",
    " ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用，由此产生的词块划分器与二元词块划分器非常接近。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    if i == 0:\n",
    "        prevword, prevpos = \"<START>\", \"<START>\"\n",
    "    else:\n",
    "        prevword, prevpos = sentence[i-1]\n",
    "    return {\"pos\": pos, \"prevpos\": prevpos}\n",
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "ChunkParse score:\n",
    "    IOB Accuracy:  93.6%\n",
    "    Precision:     81.9%\n",
    "    Recall:        87.2%\n",
    "    F-Measure:     84.5%\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下一步，我们将尝试为当前词增加特征，因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现，大约1.5个百分点（相应的错误率减少大约10％）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    if i == 0:\n",
    "        prevword, prevpos = \"<START>\", \"<START>\"\n",
    "    else:\n",
    "        prevword, prevpos = sentence[i-1]\n",
    "    return {\"pos\": pos, \"word\": word, \"prevpos\": prevpos}\n",
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "ChunkParse score:\n",
    "    IOB Accuracy:  94.5%\n",
    "    Precision:     84.2%\n",
    "    Recall:        89.4%\n",
    "    F-Measure:     86.7%\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最后，我们尝试用多种附加特征扩展特征提取器，例如预取特征[1]、配对特征[2]和复杂的语境特征[3]。这最后一个特征，称为tags-since-dt，创建一个字符串，描述自最近的限定词以来遇到的所有词性标记，或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "def npchunk_features(sentence, i, history):\n",
    "    word, pos = sentence[i]\n",
    "    if i == 0:\n",
    "        prevword, prevpos = \"<START>\", \"<START>\"\n",
    "    else:\n",
    "        prevword, prevpos = sentence[i-1]\n",
    "    if i == len(sentence)-1:\n",
    "        nextword, nextpos = \"<END>\", \"<END>\"\n",
    "    else:\n",
    "        nextword, nextpos = sentence[i+1]\n",
    "    return {\"pos\": pos,\n",
    "            \"word\": word,\n",
    "            \"prevpos\": prevpos,\n",
    "            \"nextpos\": nextpos,\n",
    "            \"prevpos+pos\": \"%s+%s\" % (prevpos, pos),  \n",
    "            \"pos+nextpos\": \"%s+%s\" % (pos, nextpos),\n",
    "            \"tags-since-dt\": tags_since_dt(sentence, i)}  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tags_since_dt(sentence, i):\n",
    "    tags = set()\n",
    "    for word, pos in sentence[:i]:\n",
    "        if pos == 'DT':\n",
    "            tags = set()\n",
    "        else:\n",
    "            tags.add(pos)\n",
    "    return '+'.join(sorted(tags))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chunker = ConsecutiveNPChunker(train_sents)\n",
    "print(chunker.evaluate(test_sents))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "ChunkParse score:\n",
    "    IOB Accuracy:  96.0%\n",
    "    Precision:     88.6%\n",
    "    Recall:        91.0%\n",
    "    F-Measure:     89.8%\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 语言结构中的递归\n",
    "## 4.1 用级联词块划分器构建嵌套结构\n",
    "\n",
    "到目前为止，我们的词块结构一直是相对平的。已标注词符组成的树在如NP这样的词块节点下任意组合。然而，只需创建一个包含递归规则的多级的词块语法，就可以建立任意深度的词块结构。4.1是名词短语、介词短语、动词短语和句子的模式。这是一个四级词块语法器，可以用来创建深度最多为4的结构。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (NP Mary/NN)\n",
      "  saw/VBD\n",
      "  (CLAUSE\n",
      "    (NP the/DT cat/NN)\n",
      "    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n"
     ]
    }
   ],
   "source": [
    "grammar = r\"\"\"\n",
    "  NP: {<DT|JJ|NN.*>+}          \n",
    "  PP: {<IN><NP>}               \n",
    "  VP: {<VB.*><NP|PP|CLAUSE>+$} \n",
    "  CLAUSE: {<NP><VP>}           \n",
    "  \"\"\"\n",
    "cp = nltk.RegexpParser(grammar)\n",
    "sentence = [(\"Mary\", \"NN\"), (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"),\n",
    "    (\"sit\", \"VB\"), (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n",
    "print(cp.parse(sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不幸的是，这一结果丢掉了saw为首的VP。它还有其他缺陷。当我们将此词块划分器应用到一个有更深嵌套的句子时，让我们看看会发生什么。请注意，它无法识别[1]开始的VP词块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (NP John/NNP)\n",
      "  thinks/VBZ\n",
      "  (NP Mary/NN)\n",
      "  saw/VBD\n",
      "  (CLAUSE\n",
      "    (NP the/DT cat/NN)\n",
      "    (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n"
     ]
    }
   ],
   "source": [
    "sentence = [(\"John\", \"NNP\"), (\"thinks\", \"VBZ\"), (\"Mary\", \"NN\"),\n",
    "    (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"), (\"sit\", \"VB\"),\n",
    "    (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n",
    "print(cp.parse(sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这些问题的解决方案是让词块划分器在它的模式中循环：尝试完所有模式之后，重复此过程。我们添加一个可选的第二个参数loop指定这套模式应该循环的次数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  (NP John/NNP)\n",
      "  thinks/VBZ\n",
      "  (CLAUSE\n",
      "    (NP Mary/NN)\n",
      "    (VP\n",
      "      saw/VBD\n",
      "      (CLAUSE\n",
      "        (NP the/DT cat/NN)\n",
      "        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))\n"
     ]
    }
   ],
   "source": [
    "cp = nltk.RegexpParser(grammar, loop=2)\n",
    "print(cp.parse(sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意\n",
    "\n",
    "这个级联过程使我们能创建深层结构。然而，创建和调试级联过程是困难的，关键点是它能更有效地做全面的分析（见第8.章）。另外，级联过程只能产生固定深度的树（不超过级联级数），完整的句法分析这是不够的。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Trees\n",
    "\n",
    "tree是一组连接的加标签节点，从一个特殊的根节点沿一条唯一的路径到达每个节点。下面是一棵树的例子（注意它们标准的画法是颠倒的）：\n",
    "```\n",
    "(S\n",
    "   (NP Alice)\n",
    "   (VP\n",
    "      (V chased)\n",
    "      (NP\n",
    "         (Det the)\n",
    "         (N rabbit))))\n",
    "```\n",
    "虽然我们将只集中关注语法树，树可以用来编码任何同构的超越语言形式序列的层次结构（如形态结构、篇章结构）。一般情况下，叶子和节点值不一定要是字符串。\n",
    "\n",
    "在NLTK中，我们通过给一个节点添加标签和一系列的孩子创建一棵树："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(NP Alice)\n",
      "(NP the rabbit)\n"
     ]
    }
   ],
   "source": [
    "tree1 = nltk.Tree('NP', ['Alice'])\n",
    "print(tree1)\n",
    "tree2 = nltk.Tree('NP', ['the', 'rabbit'])\n",
    "print(tree2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以将这些不断合并成更大的树，如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S (NP Alice) (VP chased (NP the rabbit)))\n"
     ]
    }
   ],
   "source": [
    "tree3 = nltk.Tree('VP', ['chased', tree2])\n",
    "tree4 = nltk.Tree('S', [tree1, tree3])\n",
    "print(tree4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面是树对象的一些的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(VP chased (NP the rabbit))\n",
      "VP\n",
      "['Alice', 'chased', 'the', 'rabbit']\n",
      "rabbit\n"
     ]
    }
   ],
   "source": [
    "print(tree4[1])\n",
    "print(tree4[1].label())\n",
    "print(tree4.leaves())\n",
    "print(tree4[1][1][1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "复杂的树用括号表示难以阅读。在这些情况下，draw方法是非常有用的。它会打开一个新窗口，包含树的一个图形表示。树显示窗口可以放大和缩小，子树可以折叠和展开，并将图形表示输出为一个postscript文件（包含在一个文档中）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree3.draw()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![7.5.png](./picture/7.5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3 树遍历\n",
    "\n",
    "使用递归函数来遍历树是标准的做法。4.2中的内容进行了演示。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) "
     ]
    }
   ],
   "source": [
    "def traverse(t):\n",
    "    try:\n",
    "        t.label()\n",
    "    except AttributeError:\n",
    "        print(t, end=\" \")\n",
    "    else:\n",
    "        # Now we know that t.node is defined\n",
    "        print('(', t.label(), end=\" \")\n",
    "        for child in t:\n",
    "            traverse(child)\n",
    "        print(')', end=\" \")\n",
    "\n",
    "t = tree4\n",
    "traverse(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5 命名实体识别\n",
    "\n",
    "在本章开头，我们简要介绍了命名实体（NE）。命名实体是确切的名词短语，指示特定类型的个体，如组织、人、日期等。5.1列出了一些较常用的NE类型。这些应该是不言自明的，除了“FACILITY”：建筑和土木工程领域的人造产品；以及“GPE”：地缘政治实体，如城市、州/省、国家。\n",
    "\n",
    "\n",
    "常用命名实体类型\n",
    "```\n",
    "Eddy N B-PER\n",
    "Bonte N I-PER\n",
    "is V O\n",
    "woordvoerder N O\n",
    "van Prep O\n",
    "diezelfde Pron O\n",
    "Hogeschool N B-ORG\n",
    ". Punc O\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(S\n",
      "  From/IN\n",
      "  what/WDT\n",
      "  I/PPSS\n",
      "  was/BEDZ\n",
      "  able/JJ\n",
      "  to/IN\n",
      "  gauge/NN\n",
      "  in/IN\n",
      "  a/AT\n",
      "  swift/JJ\n",
      "  ,/,\n",
      "  greedy/JJ\n",
      "  glance/NN\n",
      "  ,/,\n",
      "  the/AT\n",
      "  figure/NN\n",
      "  inside/IN\n",
      "  the/AT\n",
      "  coral-colored/JJ\n",
      "  boucle/NN\n",
      "  dress/NN\n",
      "  was/BEDZ\n",
      "  stupefying/VBG\n",
      "  ./.)\n"
     ]
    }
   ],
   "source": [
    "print(nltk.ne_chunk(sent)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6 关系抽取\n",
    "\n",
    "一旦文本中的命名实体已被识别，我们就可以提取它们之间存在的关系。如前所述，我们通常会寻找指定类型的命名实体之间的关系。进行这一任务的方法之一是首先寻找所有X, α, Y)形式的三元组，其中X和Y是指定类型的命名实体，α表示X和Y之间关系的字符串。然后我们可以使用正则表达式从α的实体中抽出我们正在查找的关系。下面的例子搜索包含词in的字符串。特殊的正则表达式(?!\\b.+ing\\b)是一个否定预测先行断言，允许我们忽略如success in supervising the transition of中的字符串，其中in后面跟一个动名词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']\n",
      "[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']\n",
      "[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']\n",
      "[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']\n",
      "[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']\n",
      "[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']\n",
      "[ORG: 'WGBH'] 'in' [LOC: 'Boston']\n",
      "[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']\n",
      "[ORG: 'Omnicom'] 'in' [LOC: 'New York']\n",
      "[ORG: 'DDB Needham'] 'in' [LOC: 'New York']\n",
      "[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']\n",
      "[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']\n",
      "[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']\n"
     ]
    }
   ],
   "source": [
    "IN = re.compile(r'.*\\bin\\b(?!\\b.+ing)')\n",
    "for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):\n",
    "    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,\n",
    "                                     corpus='ieer', pattern = IN):\n",
    "        print(nltk.sem.rtuple(rel))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "搜索关键字in执行的相当不错，虽然它的检索结果也会误报，例如[ORG: House Transportation Committee] , secured the most money in the [LOC: New York]；一种简单的基于字符串的方法排除这样的填充字符串似乎不太可能。\n",
    "\n",
    "如前文所示，conll2002命名实体语料库的荷兰语部分不只包含命名实体标注，也包含词性标注。这允许我们设计对这些标记敏感的模式，如下面的例子所示。clause()方法以分条形式输出关系，其中二元关系符号作为参数relsym的值被指定[1]。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "VAN(\"cornet_d'elzius\", 'buitenlandse_handel')\n",
      "VAN('johan_rottiers', 'kardinaal_van_roey_instituut')\n",
      "VAN('annie_lennox', 'eurythmics')\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import conll2002\n",
    "vnv = \"\"\"\n",
    "(\n",
    "is/V|    # 3rd sing present and\n",
    "was/V|   # past forms of the verb zijn ('be')\n",
    "werd/V|  # and also present\n",
    "wordt/V  # past of worden ('become)\n",
    ")\n",
    ".*       # followed by anything\n",
    "van/Prep # followed by van ('of')\n",
    "\"\"\"\n",
    "VAN = re.compile(vnv, re.VERBOSE)\n",
    "for doc in conll2002.chunked_sents('ned.train'):\n",
    "    for r in nltk.sem.extract_rels('PER', 'ORG', doc,\n",
    "                                   corpus='conll2002', pattern=VAN):\n",
    "        print(nltk.sem.clause(r, relsym=\"VAN\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
